Java: Convert all collection and array steps from taint flow to value flow.#5751
Java: Convert all collection and array steps from taint flow to value flow.#5751aschackmull merged 15 commits intogithub:mainfrom
Conversation
| "java.util.stream;SpinedBuffer;true;copyInto;(Object[],int);;Element of Argument[-1];ArrayElement of Argument[0];value", | ||
| "java.util.stream;SpinedBuffer<>$OfPrimitive;true;copyInto;(Object,int);;Element of Argument[-1];ArrayElement of Argument[0];value", | ||
| "java.util;List;false;copyOf;(Collection);;Element of Argument[0];Element of ReturnValue;value", | ||
| "java.util;List;false;of;(Object[]);;Element of Argument[0];Element of ReturnValue;value", |
There was a problem hiding this comment.
OOI, what does the empty of in ;of; mean?
There was a problem hiding this comment.
That's just the name of the varargs method List::of and not related to the DSL.
smowton
left a comment
There was a problem hiding this comment.
Reviewed up to but excluding NavigableMap
| sink.(DataFlow::ImplicitVarargsArray).getCall() = arg.getCall() | ||
| exists(Type elemType | | ||
| arrayReadStep(src, sink, elemType) and | ||
| not elemType instanceof PrimitiveType and |
There was a problem hiding this comment.
Comment on why these exclusions are made here?
There was a problem hiding this comment.
Because numbers usually don't carry taint. I.e. by default we'll say that an int[] can be tainted without implying that the individual elements are tainted, whereas when a MyObject[] is tainted then we'll take that to mean that all of the contained MyObjects are tainted as well.
There was a problem hiding this comment.
Oh sure, but I'm surprised to find this changing as part of this work? Didn't we previously permit taint to travel from x to x[0] even if x was an int[]?
There was a problem hiding this comment.
You could say that we're adding a taint step here rather than restricting one, since the general x to x[0] step is now considered an array read step that's expected to be matched by an array store step.
| "java.util;Map<>$Entry;true;setValue;;;MapValue of Argument[-1];ReturnValue;value", | ||
| "java.util;Map<>$Entry;true;setValue;;;Argument[0];MapValue of Argument[-1];value", |
There was a problem hiding this comment.
Will this pair work as expected? (the return is the old value)
There was a problem hiding this comment.
Yes, when we're targeting an argument as the output then it actually means the corresponding post-update node. So there's no overlap between these two lines, and thus should work just fine (with the obvious exception that the update isn't reflected back to the underlying map, which is a lot harder to do, since that's moving into aliasing territory).
| "java.util;Map;true;get;;;MapValue of Argument[-1];ReturnValue;value", | ||
| "java.util;Map;true;getOrDefault;;;MapValue of Argument[-1];ReturnValue;value", | ||
| "java.util;Map;true;getOrDefault;;;Argument[1];ReturnValue;value", | ||
| "java.util;Map;true;put;;;MapValue of Argument[-1];ReturnValue;value", |
There was a problem hiding this comment.
Same here I suppose -- the old value is returned, but we're presumably going to wire these steps together and alias the new one
| "java.util;Map;true;entrySet;;;MapKey of Argument[-1];MapKey of Element of ReturnValue;value", | ||
| "java.util;Map;true;get;;;MapValue of Argument[-1];ReturnValue;value", | ||
| "java.util;Map;true;getOrDefault;;;MapValue of Argument[-1];ReturnValue;value", | ||
| "java.util;Map;true;getOrDefault;;;Argument[1];ReturnValue;value", |
There was a problem hiding this comment.
Yep. I'll add it.
| "java.util;List;true;set;;;Argument[1];Element of Argument[-1];value", | ||
| "java.util;List;true;subList;;;Element of Argument[-1];Element of ReturnValue;value", | ||
| "java.util;List;true;add;(int,Object);;Argument[1];Element of Argument[-1];value", | ||
| "java.util;List;true;set;(int,Object);;Argument[1];Element of Argument[-1];value", |
There was a problem hiding this comment.
good catch, will remove
| "java.util;Queue;true;poll;();;Element of Argument[-1];ReturnValue;value", | ||
| "java.util;Queue;true;remove;();;Element of Argument[-1];ReturnValue;value", | ||
| "java.util;Queue;true;offer;(Object);;Argument[0];Element of Argument[-1];value", | ||
| "java.util;Deque;true;descendingIterator;();;Element of Argument[-1];ReturnValue;value", |
There was a problem hiding this comment.
Other iterators seem to use Element of ...
There was a problem hiding this comment.
yeah, that's a typo...
smowton
left a comment
There was a problem hiding this comment.
Reviewed the rest.
I guess now is the time to consider whether all these parallel routings of MapKeys and MapValues is enough to make us wish maps were composed of Elements which themselves have Keys and Values.
| "java.util;Arrays;false;fill;(char[],char);;Argument[1];ArrayElement of Argument[0];value", | ||
| "java.util;Arrays;false;fill;(short[],int,int,short);;Argument[3];ArrayElement of Argument[0];value", | ||
| "java.util;Arrays;false;fill;(short[],short);;Argument[1];ArrayElement of Argument[0];value", | ||
| "java.util;Arrays;false;fill;(int[],int,int,int);;Argument[3];ArrayElement of Argument[0];value", |
There was a problem hiding this comment.
I note lots of primitive propagation steps that are ruled out elsewhere?
There was a problem hiding this comment.
Where are they ruled out?
There was a problem hiding this comment.
I thought these would be ruled out by the numeric-type exclusions discussed in the other thread, but perhaps not?
There was a problem hiding this comment.
No. The bulk of this work is about adding value-flow through arrays and collections by modelling the steps as proper read- and store steps (aka as field flow). This means that the PR removes those steps from the default set of taint steps, and instead adds them as general read- and store steps that use the synthetic field ArrayElement. The array read step with the type exclusions in TaintTrackingUtil.qll is a taint step and not a read step.
|
Jdk11 has still not finished its query evaluations, so performance is likely not looking very good. |
Looks like I accidentally wrote a big cartesian product. Oops. |
ac5a157 to
febc063
Compare
|
The full combination of precise collection read/store steps and backwards compatible taint steps appears to be too costly. Let's see if a suitable middle ground works. |
9312507 to
d828e24
Compare
|
Config-dependent store-as-taint: https://jenkins.internal.semmle.com/job/Changes/job/Java-Differences/1388/ |
|
Test tentative performance tweak: https://jenkins.internal.semmle.com/job/Changes/job/Java-Differences/1403/ |
bb377e3 to
cd87559
Compare
|
I've btw. checked the fixed results on geode. They are of the following form: I've left out the irrelevant parts (some paths are > 100 steps, and the above steps are separated by many calls). With collection flow as taint we get this sort of FP flow, but with precise collection flow we are able to prune the path based on type information, as we are able to retain the type |
Nice! |
cd87559 to
e21063a
Compare
| private import semmle.code.java.dataflow.TaintTracking2 | ||
|
|
||
| private class StoreTaintConfig extends TaintTracking::Configuration { | ||
| StoreTaintConfig() { this instanceof TaintTracking::Configuration or none() } |
There was a problem hiding this comment.
This is somewhat unusual, could you explain what's going on here? How does this not cause collisions with other implementations of TaintTracking::Configuration actually used in the query?
There was a problem hiding this comment.
This is indeed unusual 😉 And yes, it collides with every other configuration. It's a hack to globally modify all other configurations with code that is itself configuration-dependent.
There was a problem hiding this comment.
Why use this vs implementing AdditionalTaintStep? Isn't the API for isAdditionalTaintStep exactly the same and doesn't creating implementations of AdditionalTaintStep have exactly the same effect?
There was a problem hiding this comment.
No. The Configuration::isAdditionalTaintStep predicate is configuration-specific, whereas AdditionalTaintStep is configuration-independent (i.e. global and applies to all configs).
There was a problem hiding this comment.
Why would this collection data flow only be applied to TaintTracking and TaintTracking2? Why not all of them?
There was a problem hiding this comment.
There are only two copies TaintTracking at the moment, so this should be all of them.
There was a problem hiding this comment.
Besides, this is a temporary measure, since it is quite costly in terms of performance.
c5d8278 to
fc913e7
Compare
cb1f31a to
1c081ee
Compare
| containerUpdateStep(n1, n2) | ||
| } | ||
|
|
||
| predicate arrayStoreStep(Node node1, Node node2) { |
There was a problem hiding this comment.
Hmm, why wasn't this caught by the PR check?
| ) | ||
| } | ||
|
|
||
| predicate arrayReadStep(Node node1, Node node2, Type elemType) { |
| ) | ||
| } | ||
|
|
||
| predicate collectionReadStep(Node node1, Node node2) { |
|
|
||
| import Cached | ||
|
|
||
| private module StoreTaintSteps { |
There was a problem hiding this comment.
Document this for the benefit of external readers who ought to know this is a stop-gap / temporary measure (and what it intends to do).
This reverts commit 1c081ee.
This converts all collection and array flow steps to use precise value flow rather than taint flow. This means that read-steps must match up with preceding store-steps, and that the tracked object can be followed more precisely through collections and arrays, both in terms of its type and in terms of the flow being applicable in pure
DataFlow::Configurations. This allows more flow for value-based data flow, which is relevant for bothDataFlow::Configurations and the subpaths ofTaintTracking::Configurationthat are restricted to value flow (i.e. when tracking through fields), and more precision in taint tracking compared to simply marking a collection or array as tainted.As an example, FP flow like the following is no longer reported:
Collection and array read steps are still also treated as taint steps to support a suitable interpretation of a "tainted collection" or "tainted array".
In addition, a temporary measure to provide a large degree of backwards compatibility with ad-hoc taint flows, a set of configuration-dependent store-as-taint steps are added to support sinks and steps that read from collections/arrays.